fix(service): recover stalled WiFi/TCP handshakes by cycling active transport#5856
Conversation
|
@jamesarich One thing worth noting from my WiFi testing, this app side fix may expose a firmware crash on some affected builds. On a tbeam running firmware 2.7.25.104df5f, the app side fix made the connection flow behave correctly and retry/reconnect cleanly. Once that happened, the node started rebooting consistently during the normal config/node-list load. Before this fix, the app could sometimes look connected or get stuck in a partial state, so it may not have hit the crashing firmware path as reliably. It looks like the app is now reaching the proper handshake path consistently and that path exposes a firmware-side crash. I have a patch in progress around As far as this PR goes the debug build is behaving well with the patched firmware, so I think it's ready for review. |
jamesarich
left a comment
There was a problem hiding this comment.
Thanks for the thorough work here, and especially for the field testing + the firmware-crash writeup — that investigation is genuinely valuable. The core fix is correct: the WiFi/TCP split-brain (transport Connected while the app is Disconnected, blocking same-node reconnect) is the right diagnosis, and cycling the transport in place is the right shape. Test coverage is strong.
Two things before this is ready to merge:
1. Scope — this PR has outgrown its title. It's now +2328/−120 across 26 files / 20 commits, several unrelated to handshake recovery. Please split into their own PR(s):
fix(data): make device link catalog refresh transactionalfix(data): separate refresh timeouts from Room persistence(DeviceHardwareRepository / DeviceLink / StaleWhileRevalidate)- the
isWifiUnavailable/ VPN-banner work (3 commits) — a distinct bug from handshake recovery
Also: squash the restart transport after post-handshake failure → narrow post-handshake recovery boundary pair, and a rebase would drop the 6 Merge branch 'main' commits. A core-connectivity change is far easier to review for regressions when it's just the recovery fix.
2. Recovery hardening (ties to your firmware finding). I dug into why the clean handshake trips the T-Beam reboot: firmware handleStartConfig() (build v2.7.25.104df5f) re-enters on every want_config with no in-flight guard and never resets its config_state sub-iterator. Android's stall-guard re-sends want_config on a timer and re-handshakes with no backoff/cap — that re-send is what re-enters the fragile firmware path mid node-list load. (The firmware write-dedup is single-slot memcmp vs the previous write only, so the interleaved heartbeats let the re-sent nonce slip past it — the "firmware will drop our retry" assumption doesn't hold.) Your firmware patch is the proper fix; a small app-side backoff + give-up cap is good defensive depth so a crashing node isn't re-driven every cycle. Details inline.
Everything else is minor (inline). Nice work overall — this is close.
|
@jamesarich Thanks - I split the independent scope into two separate PR branches and kept this PR focused on the core handshake recovery work. New split branches:
I also addressed the recovery-hardening feedback locally on the core branch:
Once the two smaller PRs land, I’ll rebase this PR onto current |
When a handshake stalls and recovery is needed, the connection manager must be able to cycle the active transport so a fresh Connected signal can re-enter the handshake flow. Previously there was no way to restart the transport without a full disconnect/reconnect cycle, leaving the transport Connected while the app reported Disconnected — a TCP split-brain that blocked same-node reconnect because the radio service still saw the same selected address and an already-running transport. restartTransport is an in-place stop/start of the currently active transport: - Preserves the selected device address and connection intent - Emits a transient DeviceSleep transition so observers see a real state change - Keeps recovery quiet from the user (no permanent Disconnected, no error) - Skips restart on explicit disconnect, device deselection, missing address, no running transport, or when another restart is already in progress - Coordinates with in-flight liveness recovery via an isRestarting CAS guard
When a handshake stall is detected, request a transport restart via runSiblingHandshakeRecovery instead of only changing app-level state. The sibling coroutine is parented to scope (not handshakeTimeout) so the restart sequence survives cancellation of the expired watchdog job. Recovery hardening: - Exponential backoff (2s base, doubling, 30s cap) between recovery attempts - After 3 consecutive failures, surface a localized sticky error and stop retrying so a crashing node is not re-driven indefinitely. Counter resets on successful handshake or sticky-error surface. - Removed mid-session want_config re-send that re-entered firmware's fragile handleStartConfig() path (root cause of T-Beam reboots under retry). Both BLE and fast transports now recover via clean transport restart. - One-way handshakeCompleteLatch prevents late-arriving FileInfo/config packets from re-arming the watchdog after NODE_INFO_NONCE - All handshakeTimeout re-arm sites use atomic getAndSet swap to prevent orphaned jobs between cancel and reassign - SafeCatchingAll wraps the localized error resolution so headless JVM tests without Skiko do not silently swallow setErrorMessage Concurrent recovery is guarded by three layers: connectionMutex serializes the Disconnected transition, the fromState=Connecting check rejects duplicate siblings, and the transport-level isRestarting CAS is authoritative.
…ures Emit onHandshakeProgress from meaningful Stage 1 packet handlers (MyNodeInfo, local metadata, file info, device config, module config, channels, DeviceUI config, early NodeInfo) and Stage 2 handlers (NodeInfo packets) so the connection manager can reset the fast watchdog on real progress. Buffer early NodeInfo received during Stage 1 so it is not lost before the node-list phase begins. Stage 2 completion remains gated on NODE_INFO_NONCE. The watchdog is cancelled synchronously when NODE_INFO_NONCE arrives, before async NodeDB install work begins. If NodeDB install fails after the firmware handshake completed, recover via recoverPostHandshakeFailure which routes through the same disconnect-before- restart path as a stalled handshake. Only pre-Connected failures trigger recovery; post-Connected analytics/side-effect failures are logged without forcing a restart.
Add a transient RECONNECTING display state while silent handshake recovery is in progress so the user sees activity rather than a frozen Connecting spinner. The state maps from Connecting + RECONNECTING_PROGRESS_TEXT and does not produce a user-facing disconnect or error. Add a localized Res.string.error_recovery_exhausted for the sticky error surfaced when recovery gives up after repeated failures. Resolved via getStringSuspend in the connection manager's recovery coroutine. Extract RECONNECTING_PROGRESS_TEXT as a shared constant in ServiceRepository so both the connection manager (writer) and the ConnectionsViewModel (reader) reference the same contract.
6c49ec5 to
221fff8
Compare
|
@jamesarich Resolved. The scope-related changes were split out and landed separately (#5881 and #5882), the merge/dead commits were removed during the rebase, and the remaining branch is now focused on the connection recovery work. The recovery hardening feedback has also been incorporated: recovery now uses transport restart with exponential backoff and a give-up cap, the same-session I also addressed the remaining inline review comments during the cleanup and rebase. Please let me know if I missed anything. Testing is continuing to go well for the firmware-side fix. I'm preparing the associated firmware PR and will link it once it's up. |
|
Quick follow-up: I opened the firmware-side PR for the crash that came up while testing this: That PR handles the |
Overview
This PR improves WiFi/TCP handshake-stall recovery so the app and underlying transport do not get stuck in different connection states.
Previously, if the handshake stalled after its retry window, recovery could leave the app-level connection state and underlying transport out of sync. For TCP, the transport could remain active while the UI reported a disconnected state. In that situation, selecting the same node again could be ignored because the radio service still saw the same selected address and an already-running transport.
This PR adds a targeted transport restart path for stalled handshakes. When recovery is needed, the connection manager first moves app state to
Disconnected, then asks the radio service to restart the active transport. The fresh transportConnectedsignal can then re-enter the normal handshake flow.Follow-up WiFi testing also showed that some failed TCP/USB handshake attempts could sit in a loading state until the lower-level TCP timeout recovered them. This PR adds a TCP/USB-only fast handshake watchdog so those stalled handshakes recover sooner, while leaving BLE's longer handshake timing unchanged.
Additional BLE testing exposed a separate post-handshake stuck-connecting path. Once firmware sends
NODE_INFO_NONCE, the handshake watchdog is intentionally cancelled before local NodeDB install work begins. If that local NodeDB install fails, the app can otherwise remain inConnectingwith no active watchdog. This PR now recovers from NodeDB install failure by routing through the same reconnecting transport-restart path.This PR also adds defensive recovery hardening so a crashing or repeatedly failing node is not re-driven indefinitely. Recovery now uses clean transport restart only, applies exponential backoff between repeated attempts, and surfaces a sticky error after repeated unsuccessful recovery cycles.
Addresses: #3727
Key Changes
Handshake Recovery
RadioInterfaceService.restartTransport()for silent recovery of stalled mesh handshakes.MeshConnectionManagerImplso retry-exceeded handshake stalls request a transport restart instead of only changing app-level state.MeshConnectionManager.recoverPostHandshakeFailure()so NodeDB install failures after firmware handshake completion use the same app-level disconnect-before-restart ordering.NODE_INFO_NONCEarrives, before async NodeDB install work begins.ConnectedNodeDB install failure; post-Connectedanalytics / side-effect failures are logged without forcing recovery.Disconnectedtransition happens before transport restart emissions.Recovery Hardening
want_configretry path.Transport Restart Behavior
SharedRadioInterfaceService.restartTransport()as an in-place stop/start of the currently active transport.DeviceSleeptransport transition before restarting so observers can see a real state change before the freshConnectedevent.Disconnectedevent and avoiding a user-facing error.Handshake Progress
NODE_INFO_NONCE.Connection UI
Reconnecting…state while silent handshake recovery is in progress.Test Coverage
Adds
MeshConnectionManagerImplTestcoverage for:Adds
MeshConfigFlowManagerImplTest/MeshConfigHandlerImplTestcoverage for:NODE_INFO_NONCEarrives,Connected.Adds
SharedRadioInterfaceServiceLivenessTestcoverage for:Disconnected,Adds Connections UI/ViewModel coverage for:
Reconnecting…mapping.Follow-up
This PR addresses the Android-side handshake-stall recovery path associated with the reported WiFi/TCP reconnect behavior.
During testing, this also uncovered a firmware-side crash triggered by repeated or re-entered config/node-list startup on firmware build
tbeam / v2.7.25.104df5f. The firmware-side fix is being tracked separately in meshtastic/firmware#10754. This PR adds app-side defensive depth so Android does not keep re-driving a node that repeatedly fails recovery.If field logs still show periodic TCP reconnect or configuration loops after this lands, a follow-up investigation should look at TCP idle-timeout behavior and add privacy-safe disconnect-reason logging for EOF, IOException, write failure, and idle timeout.
I mostly run nRF52 devices over BLE, so I do not have a strong baseline for long-term WiFi behavior. I tested these changes with WiFi-connected ESP32 devices while reconnecting, scanning, and disabling/re-enabling WiFi. I also tested the post-handshake recovery and recovery-hardening changes on my BLE device after the T1000-E stuck-connecting behavior surfaced during review.